26 research outputs found

    Structural Equation Modeling and simultaneous clustering through the Partial Least Squares algorithm

    Full text link
    The identification of different homogeneous groups of observations and their appropriate analysis in PLS-SEM has become a critical issue in many appli- cation fields. Usually, both SEM and PLS-SEM assume the homogeneity of all units on which the model is estimated, and approaches of segmentation present in literature, consist in estimating separate models for each segments of statistical units, which have been obtained either by assigning the units to segments a priori defined. However, these approaches are not fully accept- able because no causal structure among the variables is postulated. In other words, a modeling approach should be used, where the obtained clusters are homogeneous with respect to the structural causal relationships. In this paper, a new methodology for simultaneous non-hierarchical clus- tering and PLS-SEM is proposed. This methodology is motivated by the fact that the sequential approach of applying first SEM or PLS-SEM and second the clustering algorithm such as K-means on the latent scores of the SEM/PLS-SEM may fail to find the correct clustering structure existing in the data. A simulation study and an application on real data are included to evaluate the performance of the proposed methodology

    Dimensionality reduction and simultaneous classication approaches for complex data: methods and applications

    Get PDF
    Statistical learning (SL) is the study of the generalizable extraction of knowledge from data (Friedman et al. 2001). The concept of learning is used when human expertise does not exist, humans are unable to explain their expertise, solution changes in time, solution needs to be adapted to particular cases. The principal algorithms used in SL are classified in: (i) supervised learning (e.g. regression and classification), it is trained on labelled examples, i.e., input where the desired output is known. In other words, supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used speculatively to generate an output for previously unseen inputs; (ii) unsupervised learning (e.g. association and clustering), it operates on unlabeled examples, i.e., input where the desired output is unknown, in this case the objective is to discover structure in the data (e.g. through a cluster analysis), not to generalize a mapping from inputs to outputs; (iii) semi-supervised, it combines both labeled and unlabeled examples to generate an appropriate function or classifier. In a multidimensional context, when the number of variables is very large, or when it is believed that some of these do not contribute much to identify the groups structure in the data set, researchers apply a continuous model for dimensionality reduction as principal component analysis, factorial analysis, correspondence analy- sis, etc., and sequentially a discrete clustering model on the object scores computed as K-means, mixture models, etc. This approach is called tandem analysis (TA) by Arabie & Hubert (1994). However, De Sarbo et al. (1990) and De Soete & Carrol (1994) warn against this approach, because the methods for dimension reduction may identify dimensions that do not necessarily contribute much to perceive the groups structure in the data and that, on the contrary, may obscure or mask the groups structure that could exist in the data. A solution to this problem is given by a methodology that includes the simultaneous detection of factors and clusters on the computed scores. In the case of continuous data, many alternative methods combining cluster analysis and the search for a reduced set of factors have been proposed, focusing on factorial meth- ods, multidimensional scaling or unfolding analysis and clustering (e.g., Heiser 1993, De Soete & Heiser 1993). De Soete & Carroll (1994) proposed an alternative to the K-means procedure, named reduced K-means (RKM), which appeared to equal the earlier proposed projection pursuit clustering (PPC) (Bolton & Krzanowski 2012). RKM simultaneously searches for a clustering of objects, based on the K-means criterion (MacQueen 1967), and a dimensionality reduction of the variables, based on the principal component analysis (PCA). However, this approach may fail to recover the clustering of objects when the data contain much variance in directions orthogonal to the subspace of the data in which the clusters reside (Timmerman et al. 2010). To solve this problem, Vichi & Kiers (2001), proposed the factorial K-means (FKM) model. FKM combines K-means cluster analysis with PCA, then finding the best subspace that best represents the clustering structure in the data. In other terms FKM works in the reduced space, and simultaneously searches the best partition of objects based on the use of K-means criterion, represented by the best reduced orthogonal space, based on the use of PCA. When categorical variables are observed, TA corresponds to apply first multiple correspondence analysis (MCA) and subsequently the K-means clustering on the achieved factors. Hwang et al (2007) proposed an extension of MCA that takes into account cluster-level heterogeneity in respondents’ preferences/choices. The method involves combining MCA and k-means in a unified framework. The former is used for uncovering a low-dimensional space of multivariate categorical variables while the latter is used for identifying relatively homogeneous clusters of respondents. In the last years, the dimensionality reduction problem is very known also in other statistical contexts such as structural equation modeling (SEM). In fact, in a wide range of SEMs applications, the assumption that data are collected from a single ho- mogeneous population, is often unrealistic, and the identification of different groups (clusters) of observations constitutes a critical issue in many fields. Following this research idea, in this doctoral thesis we propose a good review on the more recent statistical models used to solve the dimensionality problem discussed above. In particular, in the first chapter we show an application on hyperspectral data classification using the most used discriminant functions to solve the high di- mensionality problem, e.g., the partial least squares discriminant analysis (PLS-DA); in the second chapter we present the multiple correspondence K-means (MCKM) model proposed by Fordellone & Vichi (2017), which identifies simultaneously the best partition of the N objects described by the best orthogonal linear combination of categorical variables according to a single objective function; finally, in the third chapter we present the partial least squares structural equation modeling K-means (PLS-SEM-KM) proposed by Fordellone & Vichi (2018), which identifies simultane- ously the best partition of the N objects described by the best causal relationship among the latent constructs

    Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data

    Get PDF
    The recent development of more sophisticated spectroscopic methods allows acquisition of high dimensional datasets from which valuable information may be extracted using multivariate statistical analyses, such as dimensionality reduction and automatic classification (supervised and unsupervised). In this work, a supervised classification through a partial least squares discriminant analysis (PLS-DA) is performed on the hy- perspectral data. The obtained results are compared with those obtained by the most commonly used classification approaches

    Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data

    Get PDF
    The recent development of more sophisticated spectroscopic methods allows acqui- sition of high dimensional datasets from which valuable information may be extracted using multivariate statistical analyses, such as dimensionality reduction and automatic classification (supervised and unsupervised). In this work, a supervised classification through a partial least squares discriminant analysis (PLS-DA) is performed on the hy- perspectral data. The obtained results are compared with those obtained by the most commonly used classification approaches

    Multiple Correspondence K-Means: Simultaneous Versus Sequential Approach for Dimension Reduction and Clustering

    Get PDF
    In this work, a discrete model for clustering and a continuous factorial one for dimension reduction are simultaneously fitted to categorical data, with the aim of identifying the best partition of the objects, described by the best orthogonal linear combinations of the factors, according to the least-squares criterion. This new methodology named multiple correspondence k-means is a useful alternative to the Tandem Analysis in the case of categorical data. Then, this approach has a double objective: data reduction and synthesis, simultaneously in the direction of rows and columns of the data matrix

    Comments about the use of PLS path modeling in building a Job Quality Composite Indicator

    Get PDF
    A composite indicator is formed when elementary indicators are compiled into a single index, on the basis of an underlying model of the multidimensional concept that is being measured. The PLS path modeling allows the estimation of composite indicators and the measurement model could be expressed both as formative and re-ective. In this paper we construct a composite indicator of job quality using the PLS path modeling approach and compare results obtained by the formative and the re-ective measurement models of the general concept. We observe that the two approaches can give different results. Consequently, we give some suggestions in order to estimate stable and reliable models

    Prototype definition through consensus analysis between fuzzy c-means and archetypal analysis

    Get PDF
    The general aim of cluster analysis is to build prototypes, or typologies of units that present similar characteristics. In this paper we propose an alternative approach based on consensus analysis of two different clustering methods to suitably obtain prototypes. The clustering methods used are fuzzy c-means (centre approach) and archetypal analysis (extreme approach). The consensus clustering is used to assess the correspondence between the clustering solutions obtained

    Comments about the use of PLS path modeling in building a Job Quality Composite Indicator

    Get PDF
    A composite indicator is formed when elementary indicators are compiled into a single index, on the basis of an underlying model of the mul- tidimensional concept that is being measured. The PLS path modeling allows the estimation of composite indicators and the measurement model could be expressed both as formative and reflective. In this paper we construct a composite indicator of job quality using the PLS path modeling approach and compare results obtained by the formative and the reflective measurement models of the general concept. We observe that the two approaches can give different results. Consequently, we give some sugges- tions in order to estimate stable and reliable models

    PROTOTYPE DEFINITION THROUGH CONSENSUS ANALYSIS BETWEEN FUZZY C-MEANS AND ARCHETYPAL ANALYSIS

    Get PDF
    The general aim of cluster analysis is to build prototypes, or typologies of units that present similar characteristics. In this paper we propose an alternative approach based on consensus analysis of two different clustering methods to suitably obtain proto- types. The clustering methods used are fuzzy c-means (centre approach) and archetypal analysis (extreme approach). The consensus clustering is used to assess the correspon- dence between the clustering solutions obtained

    Comments about the use of PLS path modeling in building a Job Quality Composite Indicator

    Get PDF
    A composite indicator is formed when elementary indicators are compiled into a single index, on the basis of an underlying model of the multidimensional concept that is being measured. The PLS path modeling allows the estimation of composite indicators and the measurement model could be expressed both as formative and re-ective. In this paper we construct a composite indicator of job quality using the PLS path modeling approach and compare results obtained by the formative and the re-ective measurement models of the general concept. We observe that the two approaches can give different results. Consequently, we give some suggestions in order to estimate stable and reliable models
    corecore